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Abstract 


:a1  solution  of  many  problems  in  continuum  dynamics  is 
se^'iously  limited  by  the  computation  rates  attainable  on  computers  with  serial 
architecture.  Parallel  processing  machines  can  achieve  much  higher  rates. 
However,  applying  additional  processors  to  a  calculation  is  only  part  of  the 
solution.  In  this  report,  parallel  algorithms  are  developed  for  explicit  and 
implicit,  Lagrangian  and  Eulerian  finite  difference  schemes  for  computational 
continuum  dynamics  in  one  spatial  dimension. 

First,  the  explicit  conservation  equations  in  the  Lagrangian  refe'^ence 
frame  are  readily  reformulated  for  concurrent  processing.  Second,  an  implicit 
solution  is  derived  for  these  equations.  This  is  important  because  it  yields 
unconditional  stability.  The  parallelism  is  achieved  via  a  block  Implicit 
ninerical  scheme.  Third,  a  rezoning  algorithm  Is  employed  with  each 
Lagrangian  integration  step  to  transform  the  mesh  back  to  the  Eulerian 
reference  frame.  Along  the  algorithmic  development  path,  a  zone-by-zone 
parallelization  gives  way  to  a  block -by -block  technique  both  of  which  are 
self-scheduling.  Then  the  latter  is  compared  to  an  approach  that  keeps  the 
parallel  processes  alive  for  many  time  steps.  At  each  step  of  this  research 
project,  the  derived  numerical  methods  provide  effective  algorithms  for 
exploiting  the  architectural  advantages  of  the  HEP  HIOOO  (Heterogeneous 
Element  processor)  computer. 
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Section  I  -  Introduction 


The  conservation  laws  of  volume,  mass,  momentum,  and  energy  apply  to  any 
continuum  material;  solid,  liquid,  gas,  plasma,  or  multiphase.  Hence,  the 
algorithms  of  computational  continuum  dynamics  are  very  important  for  the 
solution  of  many  scientific  problems.  When  the  application  is  changed  from 
one  material  to  another,  only  the  material  law  (equation  of  state,  constitu¬ 
tive  relation,  or  rate  relation)  changes.  Thus,  there  is  a  similarity  of 
structure  between  the  hydrocodes  (gas  and  liquid  dynamics),  the  wavecodes 
(solid  dynamics),  and  the  magnetohydrocodes  (plasma  dynamics)  that  are  the 
computer  in^lementations  of  schemes  for  continuum  dynamics  calculations. 

It  is  well  known  that  computer  simulation  codes  are  cost-effective  tools 
in  continuum  dynamics  research.  Indeed,  a  variety  of  problems  arising  in  such 
fields  as  aeronautics,  controlled  fusion,  meteorology,  reactor  safety,  and 
structural  analysis  provide  strong  motivation  for  the  development  of  higher 
computing  rates.  However,  the  limits  of  current  computing  power  prevent  the 
simulation  of  many  Important  problems  to  the  desired  levels  of  temporal  and 
spatial  resolution.  The  speed  of  light  barrier  imposes  a  theoretical  limit  on 
what  can  be  achieved  with  serial  architecture. 

To  achieve  higher  computing  rates  It  has  become  necessary  to  perform 
calculations  in  parallel.  The  computer  architecture  with  the  greatest  degree 
of  parallelism  is  labeled  Multiple  Instruction  stream.  Multiple  Data  stream 
(MIMD).  An  example  of  a  machine  of  this  type  Is  the  HEP  HIOOO  computer  manu¬ 
factured  by  the  Denelcor  Corporation. 

In  principle,  many  hydrocodes  and  wavecodes  could  be  moved  to  a  parallel 
processor.  However,  applying  additional  processors  to  a  computational  task  Is 
not,  in  general,  sufficient  to  produce  significant  speed-up.  Indeed,  the 
development  of  parallel  algorithms  is  an  area  of  research  vital  to  the 
effectiveness  of  parallel  processors.^  Recent  research  indicates  that  the 
parallelization  of  a  program  should  be  organized  from  the  top  down.^*^  That 
Is,  the  existing  structure  and  organization  of  a  program  do  not  permit  signif¬ 
icant  Improvements  In  speed.  Consequently,  It  becomes  necessary  to  reformu¬ 
late  algorithms  and  to  write  new  code. 


The  direct  approach  to  the  construction  of  parallel  algorithms  for  con¬ 
tinuum  dynamics  calculations  can  be  quite  complicated.  Rather  than  plunge 
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into  a  development  project  intended  to  generate  a  code  with  broad  three- 
dimensional  capabilities,  we  have  taken  a  step  by  step  approach  suggested  by 
Darrell  Hicks. ^  Thus,  in  a  modular  fashion,  the  algorithms  at  each  level  of 
complexity  can  be  verified  as  they  are  derived.  Our  approach  to  the  numerical 
solution  of  the  problems  of  continuum  dynamics  leads  to  algorithms  well  suited 
for  parallel  architecture  in  general  and  for  the  HEP  HIOOO  computer  in  partic¬ 
ular.  The  approach  is  a  step-by-step  procedure  based  on  a  progression  from 
the  simplest  hydrocode  (one-dimensional,  single-phase,  explicit,  Lagrangian) 
through  the  most  complex  continuum  dynamics  codes.  The  latter  programs  can 
involve  two  or  three  dimensions,  multiphase,  impl icit-optional -expl icit 
differencing,  arbitrary  rezoning  coordinate  systems  (Lagrangian  or  Eulerian), 
or  variable  time  steps  from  spatial  tone  to  zone. 

The  specific  objectives  of  the  research  reported  in  this  paper  Involve  a 
three-step  process  for  the  development  of  parallel  algorithms  for  one¬ 
dimensional  simulation  codes.®  First,  we  will  consider  an  explicit,  one¬ 
dimensional,  single-phase  Lagrangian  hydrocode.  Its  inherently  simple  data 
structure  makes  it  straightforward  to  integrate  the  volume,  momentum,  and 
energy  equations  for  each  zone,  or  block  of  zones,  in  parallel.  Thus,  for 
this  case,  the  optimal  parallelization  problem  is  evidently  easily  solved.  We 
show  that,  in  some  sense,  this  is  the  best  restructuring  of  the  algorithm. 
Second,  a  block  in^licit  method  is  derived  for  the  implicit  differencing  of 
the  equations.  Third,  we  will  convert  the  programs  from  Lagrangian  to 
Eulerian  coordinates  in  such  a  way  that  the  conserved  quantities  are  preserved 
in  a  parallel  processing  scheme. 

Extensions  to  Eulerian  codes  can  be  achieved  ty  a  rezoning  technique.  It 
has  precursors  in  the  work  of  F.  Harlow®  (Particle  In  Cell  codes)  and  in  the 
work  of  W.  Johnson  (PIC  codes  converted  to  continuum  simulations).  We  refer 
to  this  method  as  the  Harlow-Johnson  rezoning  technique.®  It  can  be  modified 
to  achieve  dynamic  rezoning  (also  known  as  adaptive  mesh)  methods.  (See 
Hicks^  and  Hicks  and  Walsh^  for  further  details.) 

The  results  of  this  work  provide  the  foundation  for  extensions  to  more 
elaborate  hydrocodes.  A  direct  extension  appears  to  be  possible  for  two-phase 
or  two-material  flow  in  Eulerian  coordinates.^  Through  stages  of  increasing 
coa^lexity,  one  can  achieve  the  block -by -block  parallelization  of  multiphase 
(or  multi -material)  and  multi -dimensional  hydrocodes.  It  appears  that 


block -by -block  parallelization  or  some  variation  thereof  will  lead  to  the 
optimal  parallel  algorithms.  Two-phase  flow  extensions  are  readily  accomp¬ 
lished  in  an  Eulerian  coordinate  system.  Discussions  of  two-phase  flow  models 
and  their  relevance  to  reactor  safety  may  be  found  in  Hicks^*®*^*^®*^^  and 
Ransom  and  Hicks^^.  One  of  the  important  problems  in  reactor  safety  is  the 
need  for  fast  simulators  and  predictors  to  assist  operators  In  handling 
situations  such  as  the  event  at  Three  Mile  Island.  Finally,  extensions  to 
mul ti -dimensions  may  be  achieved  by  operator  splitting. 

While  the  development  of  hydrocodes  is  a  long  standing  achievement,  it  is 
the  utilization  of  new  and  unique  advances  in  computer  architecture  for  hydro¬ 
code  calculations  that  makes  our  research  important  and  timely.  In  each  of 
the  three  steps,  the  crucial  questions  concern  the  extent  to  which  the 
algorithms  can  be  separated  into  calculations  that  are  performed  concur¬ 
rently.  The  algorithmic  details  are  developed  in  Section  III.  Converting  to 
Eulerian  coordinates  is  the  topic  of  Section  IV,  The  computational  results 
are  presented  in  Section  V. 

The  design  of  a  parallel  algorithm  is  interrelated  with  the  particular 
architecture  of  the  parallel  processor. That  is,  although  some  progress 
has  been  made,^^  general  specifications  for  machine-independent  algorithms 
have  not  yet  been  agreed  upon.  In  Section  11,  we  begin  by  providing  a 
description  of  the  architectural  implications  for  parallel  algorithms  on  the 
HEP  computer. 

Section  n  -  The  HEP  CoMputer 

If  we  categorize  computer  architectures  by  their  parallel  processing 
capabilities,  we  have:  SISO,  STMO,  and  MIMO.  SISD  stands  for  Single  Instruc¬ 
tion  stream.  Single  Data  stream.  The  typical  serial  computer  has  SISD  archi¬ 
tecture.  SIMO  stands  for  Single  Instruction  stream.  Multiple  Data  stream. 
Vector  machines  such  as  the  CRAY-1  have  this  architecture.  The  multiple  data 
streams  consist  of  the  components  of  the  vectors.  The  Instruction  mode  is 
still  single  stream  (or  serial)  although  the  instructions  may  generate  vector 
operations.  MIMO  stands  for  Multiple  Instruction  stream.  Multiple  Data  stream. 
The  HEP  (Heterogeneous  Element  Processor)  by  Oenelcor  has  MIMO  architecture. 

The  HEP  computer  is  designed  to  combine  from  one  up  to  16  Process  Execu¬ 
tion  Modules  (PEM's)  in  a  single  computer.  Each  PEM  is  an  eight-segment  pipe¬ 
lined  processor  consisting  of  eight  Function  Units  in  the  Instruction 
Processing  Unit.  The  Function  Units  include  the  hardware  for  initiating 
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parallel  processing  and  for  an  integer  addition,  a  floating  point  addition,  a 

multiplication,  and  several  divides.  The  machine  has  a  64-bit  word  length. 

More  information  on  the  HEP  architecture  and  on  its  applications  can  be  found 

?  3 

in  the  references  by  Smith'-  and  Jordan. 

HEP  Software 

From  the  software  point  of  view,  the  HEP  achieves  its  parallel  processing 
capabilities  by  extending  FORTRAN  77  In  two  ways.  With  their  first  implemen¬ 
tation  of  a  compiler,  Oenelcor  provided  the  CREATE  statement  and  the  asynchro¬ 
nous  variable.  In  their  more  recent  release,  they  have  replaced  these  with 
callable  subroutines  which  make  the  FORTRAN  portable.  Even  though  this  is  a 
good  objective,  the  authors  deplore  the  change  of  heart  at  Oenelcor.  The 
CREATE  statement  and  the  asynchronous  variable  gave  the  programmer  more 
explicit  language  for  coding  calculations  that  are  intended  to  be  performed 
concurrently.  In  any  event,  a  discussion  of  these  two  terms  is  appropriate 
for  this  report  because  it  provides  a  good  explanation  of  the  considerations 
that  must  be  taken  into  account  when  processing  is  divided  into  and  reunited 
from  parallel  paths. 

1.  The  CREATE  SUBROUTINE  statement  is  similar  (syntactically)  to  the  well- 
known  CALL  SUBROUTINE  statement  in  FORTRAN.  It  has  the  effect  of  creating 
a  copy  of  (or  "cloning")  the  original  subroutine  and  executing  the  copy  in 
a  calculational  stream  parallel  to  the  mainstream. 

2.  The  asynchronous  ( “dol 1 ar-si gn" )  variable  is  any  acceptable  FORTRAN 

variable  name  prefixed  with  a  Asynchronous  variables  are  used  for 

communication  between  parallel  computational  streams.  Asynchronous 
variables  have  two  states,  "full"  and  "empty".  If  a  FORTRAN  assignment 
statement  contains  an  asynchronous  variable  on  the  right  hand  side  of  the 
equal  sign,  then  the  calculation  waits  until  the  state  of  the  asynchronous 
variable  is  full.  If  its  state  is  full,  then  the  value  is  fetched  and  the 
state  is  set  to  empty.  If  the  left  hand  side  of  the  equal  side  of  a 
FORTRAN  assignment  statement  is  an  asynchronous  variable,  then  the  assign¬ 
ment  of  its  value  waits  until  its  state  is  empty.  Then,  when  the  assign¬ 
ment  is  made,  its  state  Is  reset  to  full. 

The  diagram  In  Figure  1  presents  a  visual  image  of  the  parallel  process¬ 
ing  as  implemented  on  the  HEP  computer.  At  CREATE  statements,  the  computa¬ 
tional  flow  can  “fork"  into  a  number  of  parallel  paths  which  the  operating 
system  assigns  to  the  available  processors.  An  empty  asynchronous  variable 
prevents  the  main  stream  from  continuing  beyond  the  point  at  which  it  needs 
the  results  from  the  parallel  streams.  A  "join"  point  Is  marked  by  the  return 
from  a  CREATE  statement  and  an  asynchronous  variable  that  must  be  reset  to 
full  by  the  last  parallel  stream  to  finish  processing. 


Figure  1.  Fork  and  join  points  in  a  FORTRAN  program  mark  the  beginning  and 
end,  respectively,  of  concurrent  (or  parallel)  processing  segments  of  the 
code. 


One  of  the  crucial  aspects  of  parallel  processing  Is,  of  course,  the 
development  of  software  capable  of  coordinating  concurrent  computational 
tasks.  Oenelcor  has  chosen  a  straightforward  extension  of  FORTRAN  for  the 
HEP.  The  "fork"  and  "join"  procedures  make  the  HEP  computer  Immediately 
accessible  to  the  traditional  scientific  programming  community.  A  collection 
of  lectures  on  various  aspects  of  concurrent  computation  contains  further 
background  mate'^lal  on  parallel  processing  algorithms  and  architecture.^^  The 
HEP  FORTRAN  77  User's  Guide  is  useful  for  more  specific  Information  on  the  HEP 
software.^® 

Section  III  -  The  Parallel  AlgorlthMS 

The  first  step  of  the  research  is  the  application  of  the  parallel  process¬ 
ing  attributes  of  the  HEP,  as  described  above,  to  an  explicit,  one-dimensional 
Lagranglan  hydrocode.  An  examination  of  the  theory  shows  that  a  tone-by-zone 
parallelization  is  straightforward.  A  block -by -block  parallelization  may  be 
more  efficient  than  the  zone-by-zone  when  the  number  of  processes  Is  signifi¬ 
cantly  less  than  the  number  of  zones.  By  a  "block"  we  mean  a  set  of 
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contiguous  zones.  The  block  (or  granularity)  size  is  found  by  dividing  the 
number  of  zones  by  the  number  of  processes. 

A  hydrodynamic  simulation  is  based  on  the  conservation  laws  of  volume, 
mass,  momentum,  and  energy.  In  the  Lagrangian  frame,  the  corresponding 
differential  equations  are  differenced  on  a  mesh  in  which  the  grid  points 
remain  fixed  in  the  material.  Let  X  be  the  Lagrangian  spatial  coordinate  (the 
one  that  identifies  the  initial  location  of  the  mass  point)  and  let  g  he  the 
mass  coordinate  with  units  of  mass  per  area.  The  two  are  related  by 

dg  =  dX  ( 1 ) 

where  p  is  the  mass  density  as  a  function  of  X  and  the  zero  superscript  indi¬ 
cates  D  is  evaluated  at  time  t  »  0.  Let  V  be  the  specific  volume  such  that 

v-i.  !2) 

This  choice  of  coordinates  guarantees  conservation  of  mass. 

Assuming  rectangular  symmetry  (slab  geometry),  the  remaining  conservation 
laws  in  differential  form  are 


•  Conservation  of  volume: 


Conservation  of  momentum; 


aV  .  Su 

at  dg 


•  Conservation  of  energy: 


(p  .,)]  (5) 

where  u  is  the  velocity  (or  specific  momentum)  of  a  fixed  point  in  the  mass,  p 
is  the  pressure,  q  is  the  material  viscosity,  and  E  is  the  total  specific 
energy.  In  vector  notation  this  system  can  be  written  as 


where 


au  ,  aF(U) 
at  ag 

u  •  (V.  u,  E)^ 


and  . 

F(U)  ■  (u,  -(p  ♦  q),  -u(p  ♦  q))'. 

As  usual,  the  superscript  T  denotes  the  transpose  of  the  array. 
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If  c  is  tue  specific  internal  energy,  then 


E  »  e 


1  2 


(9) 


Taking  the  partial  derivative  of  equation  (9)  with  respect  to  ti^ne  and  substi¬ 
tuting  equations  (3),  (4),  and  (S)  into  the  result  yields  an  internal  energy 
equation 

It  It 

in  place  of  equation  (5).  This  system  of  equations  (3),  (4),  and  (S)  or  (10) 
is  incomplete  without  an  equation  for  the  pressure.  The  system  is  closed  with 
a  thermomechani cal  equation  of  state  (EOS)  relating  p  to  V  and  c. 


Of  the  two  most  common  finite  differencing  schemes,  we  have  chosen  the 
one  attributable  to  von  Neumann-ilichtmyer^^  over  that  of  Lax-Wendrof f .  It 
appears  to  have  good  accuracy  and  less  tendency  to  oscillate  in  response  to 
strong  shocks  and  rarefactions.^®  For  the  discretization  of  the  time  and  mass 
per  area  independent  variables,  we  adopt  the  usual  convention  by  which  a 
superscript  denotes  time  dependence  and  a  subscript  indicates  the  location  in 
the  mass  variable.  With  this  notation,  the  half  integers  denote  time  and  mass 
centering  in  the  mesh.  Thus,  the  von  Neumann-Richtmyer  discretization  scheme 
for  the  conservation  laws  becomes^ 

•  Conservation  of  volume: 


.19 


„n+l  „n 

-  t"  ‘'j  +  1  ■  "j 


.n>l 


'll) 


•  Conservation  of  momentum: 

n^l/2  n-1/2 

u.  -  u^ 


J  „  J  .  -  . 
^n>l/2  ,  ^n-l/i 


r  ” 

'  P  . 


J->-l/2 


0-1/2, 

*^j»l/2' 


n 

Pj-1/2  *^3-1/2^ 


■j+l/2  ■  “j-1/2 


(12) 


•  Energy  equation: 


j+1/2 


f  1  rn"* 

I7  (Pj  + 


n+1  n 
1/2  ‘’j^l/2 


^  ^  ‘^j^l/2^^'^Jtl/2  ■  ''j^l/2^  * 


The  EOS  expresses  Pj+i/2  Terms  of  '^j+i/2  'j+1/2* 

When  real  viscous  effects  are  negligible  or,  at  least,  insufficient  to 
handle  severe  gradients  in  the  physical  properties  of  the  material,  an 


artificial  viscosity  is  sjpprimposed  to  average  the  abrupt  variatiuns  over 
several  adjacent  zones.  A  well  vnown  implementat i on  of  artificial  viscosity 
is  given  by  a  fomiula  due  to  von  Neumann  and  Ricntmyer  as  modified  by 
Rosenbluth^^  and  Landshoff.^^ 

fhe  superscripts  in  the  difference  equations  (11),  (12),  and  (13)  indi¬ 
cate  the  order  in  which  they  are  solved.  First,  the  velocity  u  is  updated 
from  time  to  time  using  equation  (12)  where,  in  practice,  it  has 

not  been  found  necessary  to  extrapolate  the  viscosity  q  to  time  t^.  Then,  in 
a  leapfrog  fashion,  equation  (11)  is  employed  to  vault  the  specific  volume  V 
over  the  velocity  from  time  t”  to  time  t”'*’^.  Finally,  all  that  remains  is  the 
energy  equation  in  which  compression  and  viscous  work  contribute  to  the 
heating  of  the  material. 

If  the  equation  of  state  has  a  tractable  analytic  expression  for  the 
pressure,  it  can  be  substituted  into  the  energy  equation  (13).  Then,  upon 
rearranging  terms,  the  internal  energy  e  is  advanced  from  time  t'’  to 
One  such  form  is  the  ideal  gas  law 

P  »  (y  -  1)  e  0  (14) 

where  r  is  the  ratio  of  specific  heats  cy  to  cj.  A  slight  generalization  of 
this  is  the  Mi e-Gruneisen  law 

P  *  f(D)  ♦rep  (15) 

where  r  >  0  is  the  Gruneisen  parameter.  This  form  is  often  used  in  research 
involving  shock  waves  in  solids.  For  both  equations  of  state  (14)  and  (15), 
it  is  easy  to  reduce  the  von  Neumann -Richtmyer  implicit  discretization  of  the 
internal  energy  equation  (13)  to  an  explicit  expression  for  IT  the 

EOS  is  purely  mechanical,  meaning  p  depends  only  on  V,  then  both  the  energy 
equation  (13)  and  the  EOS  are  not  needed  to  complete  the  system  and  they  may 
be  omitted  altogether. 

One  of  the  advantages  of  the  parallel  computer  (MIMD  architecture)  over 
the  vector  computer  (SIMO  architecture)  pertains  to  the  parallelization  of  the 
material  law  routine.  These  calculations  do  not  in  general  vectorize.  He 
have  observed  that,  in  certain  cases,  a  large  portion  (often  over  751)  of  the 
processing  time  is  spent  on  computationally  intensive  material  laws. 

More  complicated  situations  occur  when  the  EOS  is  not  as  straightforward 
as  equation  (14)  or  equation  (15)  and  when  energy  transport  mechanisms  such  as 
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conduction  become  important.  In  such  cases,  equation  (13)  with  additional 
terms  for  energy  transport  mechanisms  must  be  solved  implicitly.  A  discus¬ 
sion  of  parallel  algorithms  for  implicit  equations  is  contained  in  the  follow¬ 
ing  paragraphs.  More  general  energy  equations  are  beyond  the  scope  of  this 
article.  An  examination  of  these  problems  is  a  natural  extension  that  the 
authors  would  like  to  promote. 

The  time  and  space  structure  of  the  data  as  it  has  just  been  described 
leaves  it  in  a  form  that  is  Immediately  amenable  for  parallelization.  The 
data  structure  for  the  conservation  of  momentum  is  illustrated  in  Figure  2 
while  the  conservation  of  volume  is  shown  in  Figure  3.  The  discrete  conserva¬ 
tion  of  momentum  equation  (12)  has  the  form 

u  ’'U,.-  o..^-o,f^  r  (16) 

new  old  right  left 

where  r  =  it/iu  and  o  *  p  ♦  q.  Similarly,  the  discrete  conservation  of  volume 
equation  (11)  has  the  form 


f 


new 


old 


(u 


right 


■  "^left^ 


;w) 


figures  2  and  3  also  show  the  offsets  in  both  time  and  space  (mass)  between 
the  momentxi  and  mass  equations  as  specified  by  the  von  Neumann-Ri chtmyer 
differencing  scheme. 

"'■q  advance  the  momentum  in  time,  we  construct  a  subroutine  to  evaluate 
equation  (16).  The  main  program  CREATES  the  optimal  number  of  copies  of  this 
routine  and  uses  the  fork-join  structure  of  parallel  programming  to  advance 
the  velocities  simultaneously.  Asynchronous  variable'  passed  between  the  main 
program  and  the  copies  indicate  when  the  main  stream  can  continue.  Then,  in 
similar  fork -paral lei -join  fashion,  the  new  velocities  are  used  to  update  the 
zone  boundary  locations  from  which  new  volumes  are  used  to  calculate  the  new 
energies.  These  kinds  of  programming  tasks  are  described  further  in  refer¬ 
ences  [3,  page  43]  and  [15,  page  90].  Finally,  the  explicit  equation  of  state 
evaluations  are  done  with  the  same  type  of  zone-by-zone  or  block-by-block 
parallel  programming  structure. 
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Figure  2.  The  data  structure  for  the  Figure  3.  The  data  structure  for 
conservation  of  momentum  has  an  explicit  the  conservation  of  volume  is  also 
form  as  stated  by  equation  (16).  explicit  as  stated  by  equation  (17). 

Ii|>11c1t  discretl/ation. 

The  second  stage  of  the  research  plan  is  the  paral lei ization  of  an 
implicit  scheme.  The  object  of  implicit  calculations  for  a  hydrocode  is  the 
removal  of  restrictions  on  the  time  step.  Explicit  time  steps  are  usually 
constrained  by  the  CFL  (Courant,  Friedrichs,  Lewy)^^*^^  condition  that 
is  required  for  stability.  Various  necessary  and  sufficient  conditions  on  the 
CFL  number  in  hydrocodes  with  classical  thermomechanical  equations  of  state 
such  as  (14)  or  (15)  are  contained  in  Richtmyer  and  Morton^^  while  some  recent 
results  for  rate  dependent  and  related  computational  techniques  such  as  Sub- 

cycling  have  been  developed  by  Miens. For  a  general  discussion  of 

hydrocode  convergence  problems,  see  a  couple  o'  the  references  by  Miens. 

The  explicit  solution  advances  the  components  of  equation  (6)  from  time 
tn-1/2  jg  ^n>l/2  time  t'’  to  time  t'’*^  in  terms  of  quantities 

evaluated  at  times  t'’  or  respectively.  Such  a  centered  difference 

scheme  yields  second  order  accuracy.  However,  the  time  centered  quantities 
have  to  be  evaluated  at  the  forward  times  or  t”'*'^,  respectively,  in 

order  for  the  unconditional  stability  of  an  implicit  solution  to  be 

achieved.  A  mixture  of  the  two,  that  is,  a  weighted  average  of  the  explicit 

and  the  implicit  terms  can  achieve  the  stability  of  the  implicit  methods  while 
retaining  some  of  the  accuracy  of  the  explicit  scheme.^ 

The  Implicit  solution  of  coupled  differential  equations  typically  leads 
to  a  tridlagonal  linear  system  for  which  a  parallel  algorithm  is  not 
immediately  obvious.  It  Is  possible  to  solve  the  equations  in  parallel  via  an 
a  priori,  symbolic  inversion  of  the  system.^  This  Involves  the  evaluation  of 
several  recursive  sequences.  The  method  divides  a  recursive  sequence  Into 
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parallel  processes  by  the  concurrent  evaluation  of  even  and  odd  terms  (or, 
more  generally,  by  the  concurrent  evaluation  of  the  n  sequences  of  terms  whose 
indices  are  equivalent  modulo  n).  Work  on  parallel  algorithms  for  the 
solution  of  a  tridiagonal  system  will  be  discussed  in  a  future  report.  Alter¬ 
natively,  for  hydrocode  calculations,  the  original  system  of  differential 
equations  can  be  divided  into  uncoupled  tridiagonal  systems  by  Inserting 
boundaries  along  which  the  values  of  the  unknowns  are  determined  by  a  stable 
explicit  scheme.  Several  explicit  steps  may  be  required  for  each  implicit 
integration  step. 

For  the  following  discussion  of  the  details,  assume  that  a  temporal  seam 
of  boundary  conditions  is  to  be  calculated  at  the  mass  mesh  point  wj.  Let  the 
implicit  integration  step  at  time  t^  be  at.  Suppose  that  the  CFL  stability 
condition  for  the  von  Neumann-Ri chtmyer  explicit  integration  requires  three 
time  steps  to  advance  the  dependent  variables  from  t'’  to  t'^  +  at.  Let 

«t  -  ^  at  (18) 

and  use  equation  (12)  to  advance  +  ^Fom  time  to  time 

5t  for  i  «  -1,  0,  1  as  shown  in  Figure  4.  Then,  since  the  pressures  are 
xnown  in  the  appropriate  zones  up  through  time  t'',  the  velocities  can  be 
advanced  from  time  ♦  5t  to  time  2  5t,  At  this  point, 

equations  (11)  and  (13)  have  to  be  integrated  in  a  similar  manner  from  time  t^ 
to  time  t^  ♦  6t  in  the  surrounding  zones.  Finally,  the  velocity  is  integrated 
from  time  ♦  2  at  to  time  ♦  3  5t  »  to  achieve  on 

the  internal  boundary  seam.  Figure  4  illustrates  the  domain  of  dependence 
for  The  velocities,  densities,  pressures,  and  energies  are  evaluated 

explicitly  within  this  domain. 

Section  IV  -  Eulerlan  Coordinates 

Once  a  Lagrangian  hydrocode  has  been  constructed,  it  can  be  converted  to 
an  Eulerlan  code  by  making  use  of  the  Harlow -Johnson  rezoning  method.^  Figure 
5  illustrates  the  technique.  An  Eulerlan  calculation  is  achieved  in  two 
steps.  The  first  is  a  Lagrangian  calculation  and  the  second  Is  a  rezoning  of 
the  mesh  back  to  Its  original  location.  Since  the  rezoning  leaves  the  mesh 
points  fixed  in  space,  the  calculation  Is  Eulerlan.  Another  way  of  viewing 
this  rezoning  scheme  is  from  the  point  of  view  of  operator  splitting.  That 
Is,  the  discretization  of  the  Eulerlan  operator  )/)t  *  ua/)x  Is  split  Into  the 
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advaftce  of  the  Lagrangian  part  (3/3t)  followed  by  the  advance  of  the  convec¬ 
tive  part  (u3/3x) . 


Boundary  seam 

t 


Figure  4.  The  domain  of  dependence  for  the  velocities  has  a  half-width  of 
just  one  zone  when  only  thre*  expl Icit  time  steps  5t  are  required  to  integrate 
the  equations  from  time  to  time 

In  the  Lagrangian  advance  the  zone  boundaries  move  from  Xj  to  for 
each  0  <  j  <  J  ♦  1.  The  top  part  of  Figure  5  shows  the  mass  moving  to  the 
right.  Then,  from  the  bottom  part  of  the  figure,  we  see  that  the  rezoning 
step  transfers  mass  from  zone  Cxj.i.  *j]  to  zone  [xj,  Similarly, 

momentum  and  energy  are  transferred  from  one  zone  to  the  next.  Careful  book¬ 
keeping  must  be  maintained  to  ensure  conservation  of  these  quantities. 

Implicit  here  Is  the  assumption  that  zone  boundaries  do  not  travel 
further  than  a  zone  width.  That  Is, 


If  this  constraint  Is  violated  then  the  re/oning  gets  a  bit  more  complica 
ted.  The  inequality  (19)  Is  equivalent  to 


x"  <  x"  ♦  At  X  x"  ,  . 

J*1  J  j  J+l 


(20) 


Thus,  if  ^or  example,  then  inequality  (20)  reduces  to 


.  h>l/2  .  .h  .  h 

Uj  At  <  X.  . 


(21) 


Constraints  of  this  form  are  reminiscent  of  the  CFL  restriction  and  are  some 
times  referred  to  as  the  material -Courant  or  the  mass-f 1 ow-Courant  restric¬ 
tions. 

This  rezoning  algorithm  automatically  conserves  volume  and  mass. 
Momentum,  Internal  energy,  and  kinetic  energy  are  each  conserved  individually 
according  to  a  step  function  or  a  linear  interpolation  for  the  accumulated 
mass  per  area.  The  derivation  of  such  models  can  be  found  in  Hicks  and 
McGrath, These  lead  to  recursive  linear  equations  for  the  dependent 
variables  for  which  parallel  algorithms  are  under  development.  For  the 
present,  the  rezoning  calculations  are  performed  within  the  block  Implicit 
framework  of  the  previous  section. 

Section  VI  -  Timing  Results  on  the  HEP 

Computer  codes  have  been  written  for  each  of  the  four  hydrodynamic  simu¬ 
lations:  explicit  and  implicit,  Lagrangian  and  Eulerian  finite  differences. 

To  demonstrate  the  Improvements  in  computational  efficiency,  the  results  are 
discussed  in  this  section  for  the  explicit  Lagrangian  hydrocode  with  a  linear 
pressure-volume  material  law.  The  simple  test  problem  for  which  the  exact 
solution  is  known  involves  shock  and  rarefaction  waves.  The  initial  condi¬ 
tions  are  an  ideal  pressure  and  density  shock  in  the  interior  of  a  motionless 
slab.  The  boundaries  are  held  fixed  and  the  shock  moves  forward  with  a  rare- 
faction  proceeding  In  the  opposite  direction, 

For  the  explicit.  Lagrangian  equations  (12)  and  (13),  self-scheduling 
processes  were  written  and  implemented  as  follows 
Fork,  compute  viscosities.  Join; 

Fork,  compute  velocities,  join; 

Fork,  compute  volumes.  Join;  and 
Fork,  cotiv>ute  pressures.  Join. 
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For  the  ione-by-zone  algorithm,  the  self-scheduling  within  each  fork  and  join 
means  that  the  processes  advance  a  single  zone  and  then  ask  for  another  zone 
until  none  remain.  For  the  block -by-block  algorithm,  between  each  fork  and 
join,  a  block  of  Ng  contiguous  zones  is  handed  over  to  each  of  the  Np  parallel 
processes.  The  algorithm  is  pre-scheduled  by  setting  NpNg  equal  to  the  total 
number  of  zones. 
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Figure  S.  Harlow's  method  rezones  the  mesh  back  to  its  previous  location 
after  each  Lagran?ian  time  step. 


The  zone-by-zone  algorithm  was  run  on  21  zones  for  4000  integration 
steps.  Figure  6  shows  the  results  over  the  range  of  one  (a  serial  computer) 
to  21  processes.  The  results  for  the  block -by -block  algorithm  are  shown  in 
Figure  7  for  the  same  variation  in  the  number  of  processes.  It  was  demon¬ 
strated  on  4200  zones  for  1000  Integration  steps. 

The  data  shovn  on  Figure  6  can  be  interpreted  according  to  the  following 
model  suggested  by  discussions  in  articles  by  Buzbee,^  Larson, and 
Flatt.^^  Consider  a  computer  program  that  consumes  a  total  processing  time  of 


T  ♦  T 
s  P 


(23) 


where  Tp  is  the  total  processing  time  spent  on  calculations  which  can  be 


divided  into  parallel  processes.  T,  is  the  processing  time  required  for  the 
calculations  that  are  performed  serially  when  the  program  is  ekecuted  in 
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either  serial  or  parallel  mode.  Then,  ijp  until  the  point  at  which  the  hard¬ 
ware  is  saturated,  the  processing  time  for  n  parallel  processes  is  jpproxi- 
mated  by 


T(n)  «  T  ♦It  ♦  k  T  (n) 
'  '  s  n  p  0 


>'24) 


where  T^(n)  is  the  CREATE  overhead  for  n  parallel  processes  and  K  is  the 
number  of  times  that  forking  and  joining  occur  in  the  execution  sequence. 
Thus,  the  last  term  accounts  for  the  parallel  processing  overhead,  that  is, 
the  computer  time  lost  to  the  creation  of  the  parallel  processes  and  to  the 
communication  between  them. 


Processing 

time 

(seconds) 


Figure  6.  Parallel  processing  time  achieves  a  minimum  and  then  increases 
linearly  wit.  the  number  of  processes  for  the  tone-by-zone  algorithm. 


A  plot  of  T(n)  versus  n  yields  a  curve  qualitatively  similar  to  that 
displayed  In  Figure  6.  In  this  case,  the  harc^are  Is  saturated  at  about  six 
processes  at  which  point  the  term  1/n  Tp  In  equation  (24)  is  replaced  by  a 
constant.  Thus,  the  effect  of  the  term  K  1^(0)  Is  a  linear  Increase  In  proc¬ 
essing  time.  The  zone-by-zone  calculations  were  done  with  too  fine  a  granu¬ 
larity.  That  is,  the  computational  chunks  were  too  small.  We  can  Increase 
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their  size  by  going  to  blocks  of  zones  instead  of  advancing  a  single  zone  at  a 
time.  This,  of  course,  leads  to  the  block*by -block  parallel  structure. 

The  results  of  the  block-by-block  calculations  as  shown  in  Figure  7 
indicate  that  the  speed-up  factor  peaks  at  a  value  just  under  nine  somewhere 
in  the  range  of  10-14  processes.  The  speed-up  factor  approaches  this  value 
because  there  are  essentially  nine  segments  in  the  calculations:  eight  in  the 
pipeline  plus  the  store  operation.  This  peaking  phenomena  in  the  range  of  10- 
14  processes  has  been  observed  by  several  other  investigators. 


figure  7.  The  block -by-block  algorithm  achieves  a  speed-up  factor  of  nearly 
nine  at  10  processes  before  starting  to  tail  off. 


Our  objective  is  the  maximization  of  the  speed-up  factor^^ 

T 

S(n)  -  -L-.  .  - 

T(n)  (n-l)Tj  n  1C  T^(n)  (25) 

where  *  ^ 

-  T(l)  (26) 

from  equation  (24).  Thus,  for  S(n)  and  the  effidency^^ 

E(n)  .  ls(n) 


(27) 


to  be  high,  the  denominator  of  equation  (25)  has  to  be  small.  This  means 
that,  for  a  fixed  number  of  processes  n,  a  quantity  limited  by  the  hardware, 
we  need  small  values  of  T^,  K,  and  1^(0). 

The  parallel  overhead  Tg(n)  is  a  machine-dependent  quantity.  It  appears 
to  be  proportional  to  n  as  shown  above.  While  this  is  true  of  other 
machines, it  can  also  be  proportional  to  log  n.^^  In  any  case,  the  values 
of  Tj  and  K  are  subject  to  the  effectiveness  of  the  parallel  algorithms. 

Thus,  for  a  given  machine.  It  Is  important  to  reduce  T^  and  K  as  much  as 
possible  In  order  to  achieve  the  highest  speed-up  factor  and  the  greatest 
efficiency.  One  final  remark  concerns  the  trade-off  between  the  CREATE 
statement  and  the  asynchronous  variables.  The  value  of  K  can  be  altered  by 
reducing  the  number  of  CREATES.  The  greatest  effect  can  be  achieved  by 
eliminating  the  repeated  fork -paral lei -join  approach  employed  for  both  the 
zone-by-zone  and  the  block-by-block  algorithms.  To  accomplish  this,  the 
calculations  for  viscosities,  velocities,  volumes,  and  pressures  were 
Incorporated  into  a  single  subroutine.  The  program  forks  Into  multiple  copies 
of  this  routine  at  the  beginning  of  the  simulation.  Then  each  process  in 
either  zone-by-zone  or  block -by -block  fashion  performs  the  complete  set  of 
calculations  for  that  time  step.  At  the  end  of  the  time  step,  synchronization 
Is  achieved  through  an  asynchronous  variable,  but  the  process  is  not 
terminated  by  returning  to  the  mainstream.  Instead,  all  processes  continue 
with  the  next  time  step.  They  do  not  join  until  the  Integration  is  complete. 

Of  course,  the  synchronization  step  constitutes  a  "join"  In  the  processing 
to  be  followed  by  another  "fork".  Thus,  this  approach  did  not  provide  the 
expected  Improvement  In  the  speed-up  factor.  Evidently,  the  memory  contention 
for  the  asynchronous  variable  outweighs  the  savings  In  CREATE  overhead.  For 
these  reasons,  we  believe  that  Figure  7  illustrates  close  to  the  optimum 
speed-up  achievable  for  the  explicit,  Lagrangian  calculations. 
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